This report is continuation of the first phase. First phase was an overview of data and being familiar with its concept and a brief overview on the data like distribution of data in multiple levels. After that I tried to meet microbiologists and be more familiar. In this phase, I’m going to explore the data and find features. There are multiple problems like normalization and dimension reduction that I’m going to test multiple ways to study the data and find responsible factors for the phenotypes.

library(readr)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(ggplot2)
library(highcharter)
## Highcharts (www.highcharts.com) is a Highsoft software product which is
## not free for commercial and Governmental use
library(stringr)
library(tidyr)
library(plotly)
## 
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## The following object is masked from 'package:stats':
## 
##     filter
## The following object is masked from 'package:graphics':
## 
##     layout
library(Rtsne)
library(pheatmap)
library(limma)

library(coin)
## Loading required package: survival
library(destiny)

Loading the data

Rows are multiple OTUs and columns are samples. Also there is a column that is more description about the data and can be collected as a vector and there is a mapping between ‘OTU ID’ and taxonomy.

microbeCounts <- read_tsv("./data/16s_taxonomic_profiles.tsv")
## Parsed with column specification:
## cols(
##   .default = col_double(),
##   `#OTU ID` = col_character(),
##   taxonomy = col_character()
## )
## See spec(...) for full column specifications.
microbeTaxonomy <- microbeCounts$taxonomy
microbeCounts %>% 
  select(-c(taxonomy)) -> microbeCounts

Now loading the metadata:

metadata <- read_csv("./data/hmp2_metadata.csv")
## Parsed with column specification:
## cols(
##   .default = col_character(),
##   week_num = col_integer(),
##   interval_days = col_integer(),
##   visit_num = col_integer(),
##   total_reads = col_integer(),
##   IntervalSequence = col_integer(),
##   ProjectSpecificID = col_integer(),
##   `Age at diagnosis` = col_integer(),
##   `Number of flora tubes collected:` = col_integer(),
##   `Rectum Flora:` = col_integer(),
##   `Ileum flora:` = col_integer(),
##   `Other inflamed flora:` = col_integer(),
##   `Non-inflamed flora:` = col_integer(),
##   `Number of tubes collected for epithelial cell biopsies:` = col_integer(),
##   `Rectum cell biopsy:` = col_integer(),
##   `Ileum cell biopsy:` = col_integer(),
##   `Other non-inflamed cell biopsy:` = col_integer(),
##   `Number of DNA/RNA tubes collected:` = col_integer(),
##   consent_age = col_integer(),
##   `If no, in what year did you come to the United States?` = col_integer(),
##   `If yes, how many times?` = col_integer()
##   # ... with 24 more columns
## )
## See spec(...) for full column specifications.
length(colnames(metadata))
## [1] 489

There are plenty of factors that can be studied and I’m going to study effect of multiple factors and their correlation with a disease named IBD that is described more in last reports.

unique(metadata$diagnosis)
## [1] "CD"     "UC"     "nonIBD"

This disease has two types named “CD” and “UC” and I’m using them like each other!

metadata %>% 
  filter(`External ID` %in% colnames(microbeCounts)) -> metadata
groups <- metadata %>% 
  select(`External ID` , diagnosis) %>% 
  mutate(isPatient = ifelse(diagnosis == "CD" | diagnosis == "UC" , 1 , 0)) %>% 
  select(`External ID` , isPatient)
sum(groups == 1)
## [1] 133
sum(groups == 0)
## [1] 45

Correlation heatmap

In order to see correlation between samples, we can use heatmp:

microbeCounts <- as.data.frame(microbeCounts)
rownames(microbeCounts) <- microbeCounts$`#OTU ID`
microbeCounts <- microbeCounts %>% 
  select(-c(`#OTU ID`))

pheatmap(microbeCounts)

It seems that we should look at statistics other than correlation!

Dimension Reduction

As we have 982 features and 178 samples, we can use lower dimensions to see if the samples can be clustered and the quality of clustering.

PCA

In PCA we try to find a linear combination of the features. Here means that linear combination of microbes. Lets see what happens:

pc <- prcomp(x = microbeCounts, center = F , scale. = F)

eigv = round(pc$sdev^2/sum(pc$sdev^2)*100, 2)
eigv = data.frame(c(1:178),eigv)
names(eigv) = c('PCs','Variance')
plot(eigv,type = "b",col = "darkblue",lwd = 2);grid()

We can visualize PCA in multiple ways to see its usefulness:

plot(summary(pc)$importance[3,], type="l",
     ylab="%variance explained", xlab="nth component (decreasing order)")
abline(h=0.99,col="red");abline(v = 32,col="red",lty=3)

So it seems that with first PCs contain less than \(\%80\) of the total variance. But we can see how are the clusters in lower dimensions:

pcr <- data.frame(pc$r[,1:3] , group = groups$isPatient)
ggplot(pcr) + 
  geom_point(aes(x = PC1 , y = PC2, color = group)) + theme_bw()

It is not satisfying in 2 dimensions. Now visualizing PCA in three dimensions:

plot_ly(data = pcr , x = ~PC1 , y = ~PC2 , z = ~PC3 , color = ~group)
## No trace type specified:
##   Based on info supplied, a 'scatter3d' trace seems appropriate.
##   Read more about this trace type -> https://plot.ly/r/reference/#scatter3d
## No scatter3d mode specifed:
##   Setting the mode to markers
##   Read more about this attribute -> https://plot.ly/r/reference/#scatter-mode

It is not satisfyable. Maybe we can use non-linear methods and get better results. Also It is better to try centering and scaling to somehow normalizing the dataset for PCA:

pc <- prcomp(x = microbeCounts, center = T , scale. = T)
plot(summary(pc)$importance[3,], type="l",
     ylab="%variance explained", xlab="nth component (decreasing order)")
abline(h=0.99,col="red");abline(v = 32,col="red",lty=3)

pcr <- data.frame(pc$r[,1:3] , group = groups$isPatient)
ggplot(pcr) + 
  geom_point(aes(x = PC1 , y = PC2, color = group)) + theme_bw()

No!

tSNE

tSNE trys to keep the structure of data and projects it in a way that distance between points in higher dimensions be kept in lower dimensions:(this tSNE is between microbes to see whether they have a specific structure or not)

rtsneObj <- Rtsne::Rtsne(X = microbeCounts , dim = 2 , perplexity = 20 , check_duplicates = F)
tsneCoords <- rtsneObj$Y
colnames(tsneCoords) <- c("X" , "Y")
ggplot(as.data.frame(tsneCoords)) + 
  geom_point(aes(x = X , y = Y)) + 
  ggtitle("tSNE on Microbes")

So lets try tSNE on the samples and see whether they can be classified or not:

rtsneObj <- Rtsne::Rtsne(X = t(as.matrix(microbeCounts)) , dim = 2 , perplexity = 20 , check_duplicates = F , 
                         pca_center = F , pca_scale =F)
tsneCoords <- as.data.frame(rtsneObj$Y)
colnames(tsneCoords) <- c("X" , "Y")
tsneCoords$group <- groups$isPatient
ggplot(as.data.frame(tsneCoords)) + 
  geom_point(aes(x = X , y = Y, color = group)) + 
  ggtitle("tSNE on Microbes")

No! it does not! We can also try it on lower perplexities:

rtsneObj <- Rtsne::Rtsne(X = t(as.matrix(microbeCounts)) , dim = 2 , perplexity = 5 , check_duplicates = F, 
                         pca_center = F , pca_scale = F)
tsneCoords <- as.data.frame(rtsneObj$Y)
colnames(tsneCoords) <- c("X" , "Y")
tsneCoords$group <- groups$isPatient
ggplot(as.data.frame(tsneCoords)) + 
  geom_point(aes(x = X , y = Y, color = group)) + 
  ggtitle("tSNE on Microbes")

Maybe we can use another dimension reduction methods that are not linear.

Linear models

Maybe we can translate composition as a linear and simple transform of microbes and their counts. But as we have a lot of features and few samples, we should remove some features and select the others:

tmp <- microbeCounts
microbesSum <- rowSums(tmp)
tmp$sumOfCounts <- microbesSum
tmp %>% 
  filter(sumOfCounts > 470) -> tmp
tmp %>% 
  select(-c(sumOfCounts)) -> tmp
microbeCountsWithPheno <- as.data.frame(t(as.data.frame(tmp)))
microbeCountsWithPheno$isPatient <- groups$isPatient

# To use regression we should only keep 
fit <- lm(isPatient ~ . , data = microbeCountsWithPheno)
summary(fit)
## 
## Call:
## lm(formula = isPatient ~ ., data = microbeCountsWithPheno)
## 
## Residuals:
##     206646     224324     206619     224326     206624     219644 
## -9.295e-06 -2.216e-04  4.353e-04  1.971e-05 -1.132e-02  2.112e-04 
##     214995     215058     206750     206617     206626     219638 
##  3.986e-04 -2.210e-05 -2.717e-04 -3.256e-03 -3.021e-02 -9.628e-04 
##     224330     224328     206710     206676     206762     206635 
## -2.972e-04 -1.836e-03  4.465e-04  2.721e-04 -1.969e-03 -4.612e-05 
##     206702     206769     206534     215007     206678     206752 
##  2.056e-03 -6.464e-03  3.174e-04 -1.079e-04 -1.248e-03 -7.062e-04 
##     206675     206572     219652     219693     206738     215048 
## -2.580e-04  2.672e-04 -2.662e-04  3.923e-04 -1.480e-04 -1.887e-17 
##     215076     206720     206630     206758     219668     206669 
##  6.793e-04  3.011e-04 -4.515e-04 -5.972e-04  5.739e-03  4.870e-02 
##     206741     206659     215056     206684     206564     206732 
##  1.163e-04 -1.430e-04  5.902e-04 -5.778e-04  9.055e-04  9.144e-06 
##     206768     219654     206648     206561     206672     206747 
##  2.502e-03 -1.626e-04 -2.470e-04 -1.567e-04  7.886e-04  1.498e-03 
##     219637     206681     219633     206643     206671     206695 
##  4.783e-01 -1.125e-03  1.748e-04  6.629e-04 -1.390e-03  2.481e-04 
##     215010     215074     206766     206704     206682     206605 
##  3.025e-05  8.410e-04  1.060e-03  7.440e-04 -9.797e-03  7.037e-04 
##     206660     206620     215080     206756     219659     215009 
##  3.292e-05 -3.830e-02  3.708e-04 -2.224e-04  1.365e-03 -4.034e-05 
##     219676     206770     206656     206645     206730     206700 
##  9.435e-05  7.611e-05 -6.047e-04  2.592e-03  1.862e-03  3.300e-06 
##     215084     206713     206725     219651     206729     206734 
##  1.913e-03 -3.231e-04 -3.467e-04 -4.969e-05 -1.419e-04 -2.296e-04 
##     206727     219646     215004     206629     206614     206719 
##  9.321e-05 -8.042e-04 -1.180e-02 -1.691e-04 -2.110e-03  2.220e-03 
##     206547     206724     206622     206603     206563     206739 
## -2.983e-05 -1.194e-05  6.335e-04  8.906e-04 -5.654e-18 -1.240e-04 
##     206636     215023     206743     224845     219634     224327 
## -7.339e-03 -6.884e-06 -2.875e-04  1.820e-04 -2.133e-04  2.684e-03 
##     206623     219692     206658     224325     206742     206618 
##  1.348e-02 -4.767e-02  9.031e-06 -2.300e-05  1.294e-04 -4.471e-04 
##     206609     206708     206647     219645     215050     224323 
## -8.849e-04 -9.518e-04  7.939e-05  5.733e-04  1.506e-05 -1.045e-05 
##     206644     206755     206677     206711     206703     206767 
## -1.462e-03  1.107e-03 -1.746e-03  2.431e-05 -1.241e-03 -3.273e-03 
##     215077     219657     206548     219669     219643     206621 
## -3.979e-04  3.924e-05  1.042e-04  1.638e-04  4.253e-04 -1.315e-01 
##     206616     206746     206733     206751     219653     206761 
##  2.510e-02 -1.043e-03  3.261e-04  4.203e-04  2.091e-04  2.655e-03 
##     215062     219670     206723     206571     219655     222171 
## -2.807e-04 -2.572e-04  2.017e-02 -2.135e-03 -5.771e-05 -2.362e-03 
##     215057     206753     206721     215075     215008     206668 
##  5.929e-06  7.010e-04  8.610e-05 -5.738e-04 -3.133e-03  1.212e-03 
##     222170     206625     206740     206731     219658     206562 
##  5.703e-03 -4.153e-03 -4.332e-04 -9.567e-06  7.130e-04  8.553e-05 
##     206757     206569     206673     224844     206627     206667 
##  5.838e-04 -9.690e-05  6.812e-04  3.747e-04 -3.023e-01 -1.058e-05 
##     206536     214994     206538     215085     215067     215061 
## -1.875e-04 -3.904e-04 -7.123e-04 -1.272e-03 -1.971e-06  1.605e-04 
##     206726     206570     215055     215049     206615     206712 
##  3.788e-04  8.591e-05 -1.732e-03  4.747e-03  2.435e-04  1.068e-03 
##     219656     206683     206709     206608     219691     206701 
##  1.350e-04  9.079e-04  3.987e-04  4.723e-04  2.376e-04 -4.911e-03 
##     206604     215003     206628     219675     206728     206754 
## -1.159e-03  1.905e-03  7.612e-03  4.531e-04 -2.634e-04  1.169e-04 
##     206718     206657     206670     206655 
## -3.262e-03 -4.341e-06  3.950e-05 -7.407e-04 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)
## (Intercept)  5.047e-01  3.481e-01   1.450    0.384
## V1           1.872e-02  1.985e-02   0.943    0.519
## V2           4.082e-02  7.968e-01   0.051    0.967
## V3          -5.224e+00  8.739e+00  -0.598    0.657
## V4          -7.357e-02  1.008e-01  -0.730    0.599
## V5          -2.501e-02  2.295e-02  -1.090    0.473
## V6          -1.499e-02  3.004e-02  -0.499    0.705
## V7           6.783e-01  8.077e-01   0.840    0.555
## V8           2.201e-01  4.818e-01   0.457    0.727
## V9          -6.887e-02  1.471e-01  -0.468    0.721
## V10          2.948e-01  6.852e-01   0.430    0.741
## V11          1.833e-03  1.582e-03   1.158    0.453
## V12          2.676e-02  1.666e-01   0.161    0.899
## V13          9.918e-02  1.765e-01   0.562    0.674
## V14          3.192e-02  4.487e-02   0.711    0.606
## V15          1.050e-01  1.713e-01   0.613    0.650
## V16          2.615e-02  9.702e-02   0.270    0.832
## V17          2.364e-01  3.575e-01   0.661    0.628
## V18         -3.628e-02  1.644e-01  -0.221    0.862
## V19         -7.510e-02  1.008e-01  -0.745    0.592
## V20         -1.122e-01  1.620e-01  -0.693    0.614
## V21         -1.551e-01  2.398e-01  -0.647    0.634
## V22          6.091e-03  1.739e-02   0.350    0.786
## V23         -3.286e-01  7.537e-01  -0.436    0.738
## V24         -6.455e-01  1.287e+00  -0.502    0.704
## V25          2.150e-02  4.400e-01   0.049    0.969
## V26          2.996e+00  4.691e+00   0.639    0.638
## V27         -5.782e-02  2.646e-01  -0.219    0.863
## V28         -3.728e-01  4.428e-01  -0.842    0.554
## V29          1.351e-01  2.530e-01   0.534    0.688
## V30          1.142e-01  7.585e-01   0.151    0.905
## V31          2.061e-02  3.952e-02   0.521    0.694
## V32          5.053e-02  5.604e-02   0.902    0.533
## V33          3.566e-01  4.017e-01   0.888    0.538
## V34          1.091e-01  2.406e-01   0.453    0.729
## V35         -2.997e-01  3.769e-01  -0.795    0.572
## V36          6.138e-02  1.358e-01   0.452    0.730
## V37         -9.670e-03  4.043e-02  -0.239    0.851
## V38         -1.534e-01  1.732e-01  -0.885    0.539
## V39          1.286e-01  2.366e-01   0.544    0.683
## V40          7.960e-01  1.457e+00   0.546    0.682
## V41         -3.210e-02  3.606e-02  -0.890    0.537
## V42         -1.344e-02  1.306e-02  -1.029    0.491
## V43         -4.239e-02  3.397e-01  -0.125    0.921
## V44         -4.766e-02  7.042e-02  -0.677    0.621
## V45          1.881e-01  5.549e-01   0.339    0.792
## V46         -1.170e-04  3.853e-02  -0.003    0.998
## V47          1.831e-02  2.580e-02   0.710    0.607
## V48          4.621e-02  9.368e-02   0.493    0.708
## V49          7.253e-03  1.044e-02   0.695    0.613
## V50         -2.468e+00  6.081e+00  -0.406    0.755
## V51         -3.890e-02  1.555e-01  -0.250    0.844
## V52          2.619e-01  3.212e-01   0.815    0.565
## V53          1.749e-02  1.068e+00   0.016    0.990
## V54          1.102e-01  1.479e-01   0.745    0.592
## V55         -7.667e-02  8.546e-02  -0.897    0.535
## V56         -1.912e+00  3.084e+00  -0.620    0.647
## V57          8.911e-02  1.194e-01   0.746    0.592
## V58         -3.560e-03  4.892e-03  -0.728    0.600
## V59          9.550e-01  3.599e+00   0.265    0.835
## V60          5.520e-01  1.182e+00   0.467    0.722
## V61         -1.892e-01  2.798e-01  -0.676    0.621
## V62          6.184e-01  9.854e-01   0.628    0.643
## V63         -3.668e-02  4.788e-02  -0.766    0.584
## V64          1.778e-01  3.500e-01   0.508    0.701
## V65          1.421e-01  3.848e-01   0.369    0.775
## V66         -6.777e-02  8.570e-02  -0.791    0.574
## V67         -1.957e-02  8.165e-01  -0.024    0.985
## V68          5.457e-03  2.980e-01   0.018    0.988
## V69          2.952e-03  7.162e-03   0.412    0.751
## V70          5.009e-04  2.166e-02   0.023    0.985
## V71          4.696e-03  2.820e-01   0.017    0.989
## V72         -3.725e-01  4.964e-01  -0.750    0.590
## V73          9.977e-01  1.240e+00   0.805    0.569
## V74         -9.111e-02  1.517e-01  -0.600    0.656
## V75          1.719e+00  2.685e+00   0.640    0.638
## V76         -3.252e-02  3.645e-01  -0.089    0.943
## V77          3.511e-03  7.013e-02   0.050    0.968
## V78          2.287e-03  1.639e-02   0.140    0.912
## V79         -6.067e-02  1.273e-01  -0.477    0.717
## V80         -4.511e-02  5.626e-02  -0.802    0.570
## V81         -8.700e-03  2.764e-02  -0.315    0.806
## V82          4.572e-02  5.401e-02   0.846    0.553
## V83          2.708e-01  2.879e-01   0.941    0.519
## V84         -2.621e-02  1.039e-01  -0.252    0.843
## V85         -3.246e-01  4.105e-01  -0.791    0.574
## V86          6.362e-03  5.708e-01   0.011    0.993
## V87          7.064e-02  3.050e-01   0.232    0.855
## V88         -2.317e-01  5.884e-01  -0.394    0.761
## V89         -5.079e-01  7.192e-01  -0.706    0.609
## V90          1.845e-01  4.836e-01   0.381    0.768
## V91         -8.945e-01  1.227e+00  -0.729    0.599
## V92          4.336e-01  7.239e-01   0.599    0.656
## V93          1.750e-01  2.971e-01   0.589    0.661
## V94          2.183e-02  3.176e-02   0.687    0.617
## V95         -1.067e-02  1.574e-02  -0.678    0.621
## V96         -3.256e-01  4.040e-01  -0.806    0.568
## V97         -1.167e+00  1.646e+00  -0.709    0.607
## V98         -2.792e-03  7.006e-03  -0.399    0.759
## V99          2.471e-01  4.610e-01   0.536    0.687
## V100         1.618e-02  3.220e-02   0.502    0.704
## V101        -8.011e-02  1.541e+00  -0.052    0.967
## V102         7.707e-01  8.468e-01   0.910    0.530
## V103         5.956e-02  2.762e-01   0.216    0.865
## V104        -1.392e+00  1.892e+00  -0.736    0.596
## V105        -3.489e-02  5.602e-02  -0.623    0.645
## V106        -3.612e-01  2.916e+00  -0.124    0.922
## V107         6.630e-01  3.015e+00   0.220    0.862
## V108         3.017e-01  5.299e-01   0.569    0.670
## V109        -1.056e+00  3.018e+00  -0.350    0.786
## V110        -2.554e-01  6.818e-01  -0.375    0.772
## V111        -2.827e-02  1.141e-01  -0.248    0.845
## V112         2.040e-02  6.423e-02   0.318    0.804
## V113        -1.579e-01  7.750e-01  -0.204    0.872
## V114        -4.813e-02  1.646e-01  -0.292    0.819
## V115        -5.676e-02  7.260e-02  -0.782    0.578
## V116        -2.798e-03  2.036e-02  -0.137    0.913
## V117        -1.489e-01  1.849e-01  -0.805    0.568
## V118        -8.592e-02  2.185e-01  -0.393    0.761
## V119        -1.341e-01  3.237e-01  -0.414    0.750
## V120        -2.066e-01  9.665e-01  -0.214    0.866
## V121        -6.144e-02  5.477e-02  -1.122    0.463
## V122        -3.145e-01  5.334e-01  -0.590    0.661
## V123        -4.011e-02  7.234e-02  -0.554    0.678
## V124        -2.170e-02  1.109e-01  -0.196    0.877
## V125        -5.747e-02  6.728e-02  -0.854    0.550
## V126         2.517e-01  2.935e-01   0.857    0.549
## V127        -3.679e-03  6.286e-01  -0.006    0.996
## V128        -1.209e-01  1.485e+00  -0.081    0.948
## V129         1.033e+00  1.081e+00   0.956    0.514
## V130        -1.963e-01  2.333e-01  -0.841    0.555
## V131         5.253e-02  2.944e-01   0.178    0.888
## V132         3.422e-01  5.837e-01   0.586    0.662
## V133         2.312e-02  1.390e-01   0.166    0.895
## V134         8.180e-02  1.079e-01   0.758    0.587
## V135         1.158e-03  1.361e-02   0.085    0.946
## V136        -2.231e-02  1.407e-01  -0.159    0.900
## V137        -5.329e-02  1.267e-01  -0.421    0.746
## V138         9.113e-02  3.293e-01   0.277    0.828
## V139         8.203e-02  2.677e-01   0.306    0.811
## V140        -2.901e-03  2.023e-01  -0.014    0.991
## V141         5.421e-02  1.371e-01   0.395    0.760
## V142        -4.619e-02  4.777e-01  -0.097    0.939
## V143         1.224e-02  9.505e-02   0.129    0.918
## V144        -1.097e-01  1.265e-01  -0.868    0.545
## V145         2.968e-01  4.063e-01   0.731    0.598
## V146         6.652e-01  8.713e-01   0.764    0.585
## V147         1.342e-01  1.402e-01   0.957    0.514
## V148         1.462e-02  1.079e-01   0.135    0.914
## V149        -2.548e-01  1.572e+00  -0.162    0.898
## V150        -2.229e-01  2.544e-01  -0.876    0.542
## V151        -7.650e-03  1.779e-02  -0.430    0.741
## V152        -4.233e-01  5.932e-01  -0.714    0.605
## V153        -3.781e-05  3.592e-04  -0.105    0.933
## V154        -3.968e-01  4.256e-01  -0.932    0.522
## V155         2.090e-01  4.801e-01   0.435    0.739
## V156         1.706e-02  2.743e-02   0.622    0.646
## V157         4.356e-02  8.333e-02   0.523    0.693
## V158         1.101e-01  4.528e-01   0.243    0.848
## V159         4.801e-01  7.219e-01   0.665    0.626
## V160        -4.757e-01  1.641e+00  -0.290    0.820
## V161         5.233e-01  8.757e-01   0.598    0.657
## V162         3.020e-03  5.066e-03   0.596    0.658
## V163         5.783e-03  1.185e-01   0.049    0.969
## V164        -1.931e-02  3.056e-02  -0.632    0.641
## V165        -6.805e-02  1.299e-01  -0.524    0.693
## V166         3.119e-01  7.351e-01   0.424    0.745
## V167        -6.166e-01  8.672e-01  -0.711    0.607
## V168        -2.348e-02  3.896e-01  -0.060    0.962
## V169         1.105e-01  2.096e-01   0.527    0.691
## V170        -9.663e-02  2.449e-01  -0.395    0.761
## V171        -5.896e-02  8.232e-02  -0.716    0.604
## V172         1.682e-01  3.031e-01   0.555    0.677
## V173        -3.735e+00  3.061e+00  -1.220    0.437
## V174        -5.518e-02  2.347e-01  -0.235    0.853
## V175         1.447e-01  9.161e-01   0.158    0.900
## V176         1.739e-02  7.549e-02   0.230    0.856
## 
## Residual standard error: 0.5887 on 1 degrees of freedom
## Multiple R-squared:  0.9897, Adjusted R-squared:  -0.8242 
## F-statistic: 0.5456 on 176 and 1 DF,  p-value: 0.8225

In this model, I tried to fit the phenotype to microbes with high counts but it seems that it is not a good model.

Generalized linear model

We can also try poisson model for this counts data:

fit <- glm(isPatient ~ . , data = microbeCountsWithPheno , family = "binomial")
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
summary(fit)
## 
## Call:
## glm(formula = isPatient ~ ., family = "binomial", data = microbeCountsWithPheno)
## 
## Deviance Residuals: 
##     206646      224324      206619      224326      206624      219644  
##  2.408e-06   2.404e-06   2.417e-06   2.409e-06   2.186e-06   2.413e-06  
##     214995      215058      206750      206617      206626      219638  
##  2.416e-06   2.408e-06   2.403e-06   2.345e-06   1.802e-06   2.390e-06  
##     224330      224328      206710      206676      206762      206635  
##  2.403e-06   2.373e-06   2.417e-06   2.414e-06   2.370e-06   2.408e-06  
##     206702      206769      206534      215007      206678      206752  
##  2.448e-06   2.282e-06   2.415e-06   2.407e-06   2.384e-06   2.395e-06  
##     206675      206572      219652      219693      206738      215048  
##  2.404e-06   2.414e-06   2.403e-06  -2.401e-06  -2.412e-06   2.409e-06  
##     215076      206720      206630      206758      219668      206669  
##  2.422e-06   2.414e-06   2.400e-06   2.397e-06   2.519e-06   3.302e-06  
##     206741      206659      215056      206684      206564      206732  
##  2.411e-06   2.406e-06   2.420e-06   2.397e-06   2.426e-06   2.409e-06  
##     206768      219654      206648      206561      206672      206747  
##  2.457e-06   2.406e-06   2.404e-06   2.406e-06   2.424e-06   2.438e-06  
##     219637      206681      219633      206643      206671      206695  
##  7.832e-06   2.387e-06   2.412e-06   2.422e-06   2.382e-06   2.413e-06  
##     215010      215074      206766      206704      206682      206605  
##  2.409e-06   2.425e-06   2.429e-06  -2.394e-06  -2.597e-06  -2.395e-06  
##     206660      206620      215080      206756      219659      215009  
## -2.408e-06  -3.120e-06  -2.401e-06   2.404e-06   2.435e-06  -2.409e-06  
##     219676      206770      206656      206645      206730      206700  
##  2.411e-06   2.410e-06   2.397e-06   2.459e-06  -2.372e-06  -2.409e-06  
##     215084      206713      206725      219651      206729      206734  
##  2.446e-06   2.402e-06  -2.415e-06  -2.410e-06   2.406e-06   2.404e-06  
##     206727      219646      215004      206629      206614      206719  
##  2.410e-06   2.393e-06   2.176e-06  -2.412e-06  -2.450e-06  -2.365e-06  
##     206547      206724      206622      206603      206563      206739  
## -2.409e-06  -2.409e-06  -2.396e-06   2.426e-06   2.409e-06   2.406e-06  
##     206636      215023      206743      224845      219634      224327  
##  2.265e-06   2.408e-06   2.403e-06   2.412e-06   2.405e-06   2.460e-06  
##     206623      219692      206658      224325      206742      206618  
##  2.666e-06   1.433e-06   2.409e-06   2.408e-06   2.411e-06   2.400e-06  
##     206609      206708      206647      219645      215050      224323  
##  2.391e-06  -2.427e-06  -2.407e-06   2.420e-06   2.409e-06   2.408e-06  
##     206644      206755      206677      206711      206703      206767  
##  2.380e-06   2.430e-06   2.375e-06   2.409e-06   2.385e-06   2.345e-06  
##     215077      219657      206548      219669      219643      206621  
##  2.401e-06   2.409e-06   2.411e-06   2.412e-06   2.417e-06   2.110e-08  
##     206616      206746      206733      206751      219653      206761  
##  2.882e-06   2.388e-06   2.415e-06   2.417e-06   2.413e-06   2.460e-06  
##     215062      219670      206723      206571      219655      222171  
##  2.403e-06   2.404e-06   2.791e-06   2.367e-06  -2.410e-06  -2.454e-06  
##     215057      206753      206721      215075      215008      206668  
## -2.409e-06  -2.395e-06  -2.407e-06  -2.420e-06  -2.469e-06  -2.385e-06  
##     222170      206625      206740      206731      219658      206562  
## -2.297e-06  -2.489e-06  -2.417e-06  -2.409e-06   2.422e-06   2.410e-06  
##     206757      206569      206673      224844      206627      206667  
##  2.420e-06   2.407e-06   2.422e-06  -2.401e-06  -6.551e-06  -2.409e-06  
##     206536      214994      206538      215085      215067      215061  
## -2.412e-06  -2.416e-06  -2.422e-06   2.384e-06   2.409e-06   2.412e-06  
##     206726      206570      215055      215049      206615      206712  
##  2.416e-06   2.410e-06   2.375e-06   2.500e-06   2.413e-06  -2.388e-06  
##     219656      206683      206709      206608      219691      206701  
## -2.406e-06   2.426e-06   2.416e-06   2.418e-06   2.413e-06  -2.503e-06  
##     206604      215003      206628      219675      206728      206754  
## -2.431e-06  -2.372e-06  -2.260e-06   2.417e-06   2.403e-06   2.411e-06  
##     206718      206657      206670      206655  
##  2.345e-06   2.409e-06   2.409e-06   2.394e-06  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)
## (Intercept)  1.844e+01  9.692e+04       0        1
## V1           2.041e+00  8.350e+03       0        1
## V2          -7.994e+01  4.744e+05       0        1
## V3          -1.263e+03  5.185e+06       0        1
## V4          -5.777e+00  4.603e+04       0        1
## V5          -1.852e+00  1.380e+04       0        1
## V6          -1.366e+00  2.367e+04       0        1
## V7           3.222e+01  5.158e+05       0        1
## V8           7.517e+01  3.015e+05       0        1
## V9           2.445e+00  7.433e+04       0        1
## V10          5.108e+01  3.468e+05       0        1
## V11          9.536e-02  9.843e+02       0        1
## V12         -1.634e+01  1.066e+05       0        1
## V13         -3.266e+00  9.514e+04       0        1
## V14          1.645e+00  1.961e+04       0        1
## V15          5.982e+00  7.572e+04       0        1
## V16          3.125e-01  5.285e+04       0        1
## V17          3.209e+01  1.567e+05       0        1
## V18          1.012e+01  8.913e+04       0        1
## V19         -2.788e+00  5.055e+04       0        1
## V20         -1.038e+01  6.792e+04       0        1
## V21         -2.049e+01  1.099e+05       0        1
## V22         -9.323e-01  9.902e+03       0        1
## V23         -1.214e+02  4.966e+05       0        1
## V24         -6.496e+01  5.961e+05       0        1
## V25         -5.246e+01  2.902e+05       0        1
## V26          5.787e+02  2.466e+06       0        1
## V27          2.506e+01  1.765e+05       0        1
## V28         -2.833e+01  1.812e+05       0        1
## V29          1.493e+01  1.386e+05       0        1
## V30          1.102e+02  5.454e+05       0        1
## V31         -1.608e+00  2.421e+04       0        1
## V32          5.565e+00  3.528e+04       0        1
## V33          3.817e+01  1.686e+05       0        1
## V34          2.335e+01  1.283e+05       0        1
## V35         -1.554e+01  1.667e+05       0        1
## V36          1.583e+01  7.057e+04       0        1
## V37         -6.356e+00  3.609e+04       0        1
## V38         -1.628e+01  1.116e+05       0        1
## V39         -5.504e+00  1.212e+05       0        1
## V40         -3.270e+01  8.895e+05       0        1
## V41         -7.148e-01  2.062e+04       0        1
## V42         -1.759e+00  8.735e+03       0        1
## V43          2.922e+01  1.994e+05       0        1
## V44         -7.580e+00  3.124e+04       0        1
## V45          1.608e+01  2.950e+05       0        1
## V46          2.744e+00  2.147e+04       0        1
## V47          1.977e+00  1.761e+04       0        1
## V48          1.493e+01  5.997e+04       0        1
## V49         -3.724e-01  1.028e+04       0        1
## V50         -6.939e+02  3.077e+06       0        1
## V51          1.622e+01  1.160e+05       0        1
## V52          2.799e+00  2.227e+05       0        1
## V53         -9.466e-01  6.097e+05       0        1
## V54          1.985e+01  7.242e+04       0        1
## V55         -1.132e+01  4.889e+04       0        1
## V56         -2.369e+02  1.461e+06       0        1
## V57          9.228e+00  5.354e+04       0        1
## V58         -3.456e-01  1.992e+03       0        1
## V59          4.100e+02  2.084e+06       0        1
## V60          1.475e+02  6.388e+05       0        1
## V61         -3.185e+01  1.217e+05       0        1
## V62          1.368e+02  5.454e+05       0        1
## V63         -3.313e+00  1.997e+04       0        1
## V64          1.765e+01  2.026e+05       0        1
## V65         -2.785e+01  2.543e+05       0        1
## V66         -3.653e+00  4.044e+04       0        1
## V67         -7.283e+01  4.431e+05       0        1
## V68         -3.089e+01  1.854e+05       0        1
## V69          7.747e-01  3.554e+03       0        1
## V70          1.618e+00  1.227e+04       0        1
## V71         -3.073e+01  1.777e+05       0        1
## V72         -4.888e+01  2.082e+05       0        1
## V73          1.904e+02  6.749e+05       0        1
## V74         -2.614e+01  1.042e+05       0        1
## V75          3.877e+02  1.574e+06       0        1
## V76         -3.582e+00  1.702e+05       0        1
## V77         -1.019e+00  3.644e+04       0        1
## V78          1.153e+00  1.101e+04       0        1
## V79         -2.041e+01  8.151e+04       0        1
## V80         -5.718e+00  3.616e+04       0        1
## V81         -1.607e+00  1.407e+04       0        1
## V82          6.227e+00  2.684e+04       0        1
## V83          6.004e+01  2.555e+05       0        1
## V84         -1.372e+01  6.776e+04       0        1
## V85         -5.393e+01  2.671e+05       0        1
## V86          4.045e+01  3.202e+05       0        1
## V87          5.403e+01  2.942e+05       0        1
## V88         -2.659e+00  3.069e+05       0        1
## V89         -5.010e+01  3.159e+05       0        1
## V90          7.414e+01  3.466e+05       0        1
## V91         -1.960e+02  7.172e+05       0        1
## V92          1.134e+02  4.407e+05       0        1
## V93          2.010e+01  1.241e+05       0        1
## V94          2.206e+00  1.413e+04       0        1
## V95         -2.196e+00  8.003e+03       0        1
## V96         -3.470e+01  1.607e+05       0        1
## V97         -3.239e+01  7.751e+05       0        1
## V98          3.668e-01  4.153e+03       0        1
## V99          2.710e+01  2.055e+05       0        1
## V100         4.685e+00  1.885e+04       0        1
## V101        -3.934e+01  9.826e+05       0        1
## V102         1.141e+02  4.009e+05       0        1
## V103         2.855e+01  1.448e+05       0        1
## V104        -2.291e+02  1.095e+06       0        1
## V105         4.453e-01  2.791e+04       0        1
## V106         1.954e+02  1.650e+06       0        1
## V107         3.687e+02  1.833e+06       0        1
## V108         7.338e+01  2.824e+05       0        1
## V109        -3.364e+02  1.597e+06       0        1
## V110        -8.037e+01  4.263e+05       0        1
## V111         7.636e+00  6.554e+04       0        1
## V112         2.879e+00  4.327e+04       0        1
## V113         5.434e+01  5.081e+05       0        1
## V114         7.629e+00  9.206e+04       0        1
## V115        -1.260e+01  4.445e+04       0        1
## V116         5.906e-01  1.093e+04       0        1
## V117        -2.052e+01  9.720e+04       0        1
## V118        -7.760e+00  1.266e+05       0        1
## V119        -2.805e+01  2.252e+05       0        1
## V120        -1.342e+02  8.123e+05       0        1
## V121        -7.992e+00  3.128e+04       0        1
## V122        -7.769e+01  3.084e+05       0        1
## V123        -1.240e+01  4.983e+04       0        1
## V124        -9.041e+00  6.136e+04       0        1
## V125        -7.375e+00  3.240e+04       0        1
## V126         3.864e+01  1.487e+05       0        1
## V127        -5.307e+01  3.449e+05       0        1
## V128        -1.897e+02  9.485e+05       0        1
## V129         1.203e+02  4.935e+05       0        1
## V130        -2.823e+01  1.177e+05       0        1
## V131         2.537e+01  1.877e+05       0        1
## V132         2.760e+01  2.862e+05       0        1
## V133         1.759e+01  8.267e+04       0        1
## V134         3.225e+00  5.659e+04       0        1
## V135         1.646e+00  8.424e+03       0        1
## V136        -1.949e+01  9.529e+04       0        1
## V137        -1.590e+01  7.325e+04       0        1
## V138         3.222e+01  1.963e+05       0        1
## V139         3.442e+00  1.649e+05       0        1
## V140         7.942e+00  1.163e+05       0        1
## V141         8.626e+00  7.182e+04       0        1
## V142         4.891e+01  2.945e+05       0        1
## V143        -8.472e-01  4.575e+04       0        1
## V144        -1.639e+01  6.956e+04       0        1
## V145         6.146e+01  2.232e+05       0        1
## V146         5.783e+01  3.801e+05       0        1
## V147         1.398e+01  7.253e+04       0        1
## V148         2.813e+00  4.583e+04       0        1
## V149        -1.879e+02  9.622e+05       0        1
## V150        -3.077e+01  1.203e+05       0        1
## V151         3.245e-01  1.343e+04       0        1
## V152        -4.788e+01  2.685e+05       0        1
## V153        -2.394e-03  2.032e+02       0        1
## V154        -7.926e+01  2.940e+05       0        1
## V155         5.754e+01  2.471e+05       0        1
## V156         1.008e+00  1.261e+04       0        1
## V157         6.938e-01  4.129e+04       0        1
## V158         9.834e+00  2.763e+05       0        1
## V159         7.167e+01  3.363e+05       0        1
## V160        -2.009e+02  9.652e+05       0        1
## V161         9.311e+01  4.566e+05       0        1
## V162        -7.110e-02  2.659e+03       0        1
## V163         1.204e+01  6.915e+04       0        1
## V164        -3.634e+00  1.670e+04       0        1
## V165        -6.824e-02  6.414e+04       0        1
## V166        -1.550e+01  3.821e+05       0        1
## V167        -1.358e+01  3.866e+05       0        1
## V168        -4.563e+01  2.780e+05       0        1
## V169         4.302e+01  2.157e+05       0        1
## V170        -3.812e+01  1.729e+05       0        1
## V171        -7.664e+00  3.411e+04       0        1
## V172         3.777e+01  1.669e+05       0        1
## V173        -5.254e+02  2.033e+06       0        1
## V174         1.675e+01  1.412e+05       0        1
## V175         8.691e+01  4.866e+05       0        1
## V176        -5.324e+00  4.237e+04       0        1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 2.0128e+02  on 177  degrees of freedom
## Residual deviance: 1.1260e-09  on   1  degrees of freedom
## AIC: 354
## 
## Number of Fisher Scoring iterations: 25

It is awful and not working. So we can forget using linear model because of two reasons: 1- To use linear models, you should have more samples than features. In order to overcome this problem, I tried to select microbes with high count. 2- The fitted model is awful! All pvalues are high and model description is not good.

Normalization

boxplot(microbeCounts)

A possible approach is to normalize counts with respect to samples. Maybe counts are higher for samples. I propose a normalization technique with l2 norm. It is as follows:

NormalizeL2Sqr <- function(iData){
  require(dplyr)
  
  sqrData <- sqrt(iData)
  sqrData <- as.data.frame(sqrData)
  
  iData %>% 
    colSums(na.rm = T) %>% 
    sqrt() -> sampleSum
  
  if(length(which(sampleSum == 0)) != 0){
    sqrData <- sqrData[,-which(sampleSum == 0)] # removing rows that their sum is zero!   
    sampleSum <- sampleSum[-which(sampleSum == 0)]
    warning("Some samples counts values are zero")
  }
  
  # removing genes that are not expressed!
  # What should I do??? ***
  
  
  # Too slow!
  # sqrData <- apply(sqrData , 2 , function(x) x/sqrt(sum(x^2)))
  if(ncol(sqrData) == 0){
   stop("All of the microbes's count in samples is zero") 
  }
  else{
    tmp <- lapply(1:ncol(sqrData) , function(i) sqrData[,i] <<- (sqrData[,i]/sampleSum[i]))
    remove(tmp)
    sqrData
  }
}

Each element is divided by a factor: \[ \frac{x_i}{\Sigma x_j^2}\] Now using this method on the data:

microbeCountsNormalized <- NormalizeL2Sqr(microbeCounts)
pheatmap(microbeCountsNormalized)

Its heatmaps looks better than before!

Now using pca on samples again and see if they can be classified easily:

pca <- prcomp(microbeCountsNormalized , center = F , scale. = F)
pcr <- data.frame(pca$r[,1:3] , group = groups$isPatient)
ggplot(pcr) + 
  geom_point(aes(x = PC1 , y = PC2, color = group)) + theme_bw()

No! Let’s use tSNE:

rtsneObj <- Rtsne::Rtsne(X = t(as.matrix(microbeCountsNormalized)) , dim = 2 , perplexity = 5 , check_duplicates = F, 
                         pca_center = F , pca_scale = F)
tsneCoords <- as.data.frame(rtsneObj$Y)
colnames(tsneCoords) <- c("X" , "Y")
tsneCoords$group <- groups$isPatient
ggplot(as.data.frame(tsneCoords)) + 
  geom_point(aes(x = X , y = Y, color = group)) + 
  ggtitle("tSNE on Microbes")

It seems unsatisfiable. If we use simple k-means clustering it would have multiple mislabeling.

Multiple hypothesis testing

We’ve tested multiple methods but there is anyway a goodway called multiple hypothesis testing to see the differences among the groups. We can We can implement our usual t-test:

tTest <- function(x1 , x2){
  x1.mean <- mean(x1)
  x2.mean <- mean(x2)
  
  x1.sd <- sd(x1)
  x2.sd <- sd(x2)
  
  df <- length(x1) + length(x2) - 2
  
  # if the variances are somehow equal, pooled variance will be:
  sps = ((length(x1) - 1) * x1.sd^2 + (length(x2) - 1) * x2.sd^2) / df
  
  tstat= abs(x1.mean - x2.mean) / (sqrt(sps) * sqrt(1/length(x1) + 1/length(x2)))
  
  pt(tstat , df , lower.tail = F) * 2
  # t.test in R is two sided
}
groups %>% 
  filter(isPatient == 1) -> tmp
microbeCountsPatients <- microbeCounts[,tmp$`External ID`]
patientsColumn <- dim(microbeCountsPatients)[2]
#dim(microbeCounts)
groups %>% 
  filter(isPatient == 0) -> tmp
microbeCountsHealthy <- microbeCounts[,tmp$`External ID`]

orderedDataSet <- cbind(microbeCountsPatients , microbeCountsHealthy)

tTestDEA <- apply(orderedDataSet , 1 , function(i) {
    x1 <- i[1:patientsColumn]
    x2 <- i[(patientsColumn + 1):ncol(orderedDataSet)]
    tTest(x1 =x1 , x2 = x2)
})
 
DEATtest <- data.frame(MicrobeName = names(tTestDEA) , P_value = tTestDEA , stringsAsFactors = F)
DEATtest <- subset(DEATtest , P_value < 0.05)
dim(DEATtest)
## [1] 112   2

We can see that 112 microbes are selected via a simple multiple hypothesis testing. But an important problem in multiple hypothesis testing is correcting p.values. I first try Bonferroni correction to see and have a sketch of pvalues:

DEATtestB <- subset(DEATtest , P_value < (0.05 / dim(DEATtest)[1]))
dim(DEATtestB)
## [1] 5 2
DEATtestB$MicrobeName
## [1] "Unc03y4v" "Unc36622" "Unc81447" "Unc38399" "Unc00y95"

Only five microbes remained. We can also try FDR correction:

tTestDEA <- apply(orderedDataSet , 1 , function(i) {
    x1 <- i[1:patientsColumn]
    x2 <- i[(patientsColumn + 1):ncol(orderedDataSet)]
    tTest(x1 =x1 , x2 = x2)
})
 
DEATtest <- data.frame(MicrobeName = names(tTestDEA) , P_value = tTestDEA , stringsAsFactors = F)
adjustedPValues <- p.adjust(DEATtest$P_value , method = "fdr")
DEATtestFDR <- DEATtest
DEATtestFDR$P_value <- adjustedPValues
subset(DEATtestFDR , P_value < 0.05)
##          MicrobeName      P_value
## Unc03y4v    Unc03y4v 0.0072747433
## Unc36622    Unc36622 0.0011250931
## Unc81447    Unc81447 0.0299726689
## Unc38399    Unc38399 0.0271134489
## Unc00y95    Unc00y95 0.0009854847
selectedMicrobes.ttest <- DEATtestFDR %>% 
  filter(P_value < 0.05)
selectedMicrobes.ttest <- selectedMicrobes.ttest$MicrobeName

Only five microbes are remained. And obviously, this genes are similar to the one selected with Bonferroni method:

intersect(DEATtestB$MicrobeName , selectedMicrobes.ttest)
## [1] "Unc03y4v" "Unc36622" "Unc81447" "Unc38399" "Unc00y95"

Permutation test

We can also use permutation test instead of t-test because of forgetting normality assumption:

orderedDataSet$microbeCounts <- rowSums(orderedDataSet)
tmpRowNames <- rownames(orderedDataSet)
zeroMicrobes <- rownames(orderedDataSet[which(orderedDataSet$microbeCounts == 0),])

orderedDataSet %>% 
  filter(microbeCounts != 0) -> orderedDataSet
dim(orderedDataSet)
## [1] 959 179
orderedDataSet %>% 
  select(-c(microbeCounts)) -> orderedDataSet
tmpRowNames <- setdiff(tmpRowNames , zeroMicrobes)
rownames(orderedDataSet) <- tmpRowNames

permTest <- apply(orderedDataSet , 1 , function(i) {
    x1 <- i[1:patientsColumn]
    x2 <- i[(patientsColumn + 1):ncol(orderedDataSet)]

    x1df <- data.frame(class = "patient" , counts = x1)
    x2df <- data.frame(class = "healthy" , counts = x2)
    tmp <- rbind(x1df , x2df)
    coin::oneway_test(counts ~ class , data = tmp , distribution = approximate(B=1000) , )
})

DEAPermTest <- data.frame()
for(i in 1:length(permTest)){
  DEAPermTest <- rbind(DEAPermTest , data.frame(MicrobeName = rownames(orderedDataSet)[i], P_value = coin::pvalue(permTest[[i]])[1]))
}
head(DEAPermTest)
##   MicrobeName P_value
## 1    IP8BSoli   0.825
## 2    UncTepi3   0.678
## 3    Unc004ii   0.823
## 4    Unc00re8   0.582
## 5    Unc018j2   0.344
## 6    Unc04u81   0.944
adjustedPValues <- p.adjust(DEAPermTest$P_value , method = "fdr")
DEAPermTestFDR <- DEAPermTest
DEAPermTestFDR$P_value <- adjustedPValues
subset(DEAPermTestFDR , P_value < 0.05)
##     MicrobeName P_value
## 118    Unc03exo       0
## 200    Unc02ruj       0
## 463    Unc03uuu       0
## 524    Unc01pk4       0
## 565    Unc05n7t       0
## 615    Unc03tc5       0
## 683    Unc03y4v       0
## 732    Unc36622       0
## 790    Unc38399       0
## 804    Unc01zba       0
## 901    Unc00y95       0
selectedMicrobes <- DEAPermTestFDR %>% 
  filter(P_value < 0.05)
selectedMicrobes.perm <- selectedMicrobes$MicrobeName

Good! I think I found 8 microbes that are very good in distinguishing the two groups. Let’s see if there is any similarities between t-test results and permutation test results:

intersect(selectedMicrobes.ttest , selectedMicrobes.perm)
## [1] "Unc03y4v" "Unc36622" "Unc38399" "Unc00y95"

So the permuation test results is a super set of t-test results. We can see whether k-means after pca can separate the samples.

selectedMicrobes <- union(selectedMicrobes.perm , selectedMicrobes.ttest)
orderedDataSet.selected <- orderedDataSet[selectedMicrobes , ]
pca <- prcomp(orderedDataSet.selected , center = F , scale. = F)
pcr <- data.frame(pca$r[,1:3] , group = groups$isPatient)
ggplot(pcr) + 
  geom_point(aes(x = PC1 , y = PC2, color = group)) + theme_bw()

Anyway it is not satisfying. After that we can see the logstic regression performance for the selected microbes vs phenotype.

t.orderedDataSet.selected <- as.data.frame(t(orderedDataSet.selected))
t.orderedDataSet.selected$isPatient <- groups$isPatient
fit <- glm(isPatient ~ . , data = t.orderedDataSet.selected , family = "binomial")
summary(fit)
## 
## Call:
## glm(formula = isPatient ~ ., family = "binomial", data = t.orderedDataSet.selected)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -2.22049   0.01304   0.57088   0.64472   1.42915  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  1.7454445  0.2703480   6.456 1.07e-10 ***
## Unc03exo     0.0142904  0.0412711   0.346  0.72915    
## Unc02ruj    -0.0019878  0.0020264  -0.981  0.32662    
## Unc03uuu    -0.5589619  0.6035458  -0.926  0.35438    
## Unc01pk4    -0.0016021  0.0005301  -3.022  0.00251 ** 
## Unc05n7t    -0.0113908  0.0367798  -0.310  0.75679    
## Unc03tc5     0.0185709  0.0206657   0.899  0.36885    
## Unc03y4v     0.0042450  0.0058668   0.724  0.46933    
## Unc36622     0.0006170  0.0026200   0.235  0.81384    
## Unc38399    -0.0031163  0.0128288  -0.243  0.80807    
## Unc01zba    -0.0810254  0.1714104  -0.473  0.63643    
## Unc00y95    -0.0012842  0.0006792  -1.891  0.05866 .  
## Unc81447     0.6295148  0.4768830   1.320  0.18681    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 201.28  on 177  degrees of freedom
## Residual deviance: 169.62  on 165  degrees of freedom
## AIC: 195.62
## 
## Number of Fisher Scoring iterations: 6

Hmm, we can see that the intersect has a little p-value so I think this model is not so good for distinguishing between two groups. Maybe we can use tSNE and visualize better:

rtsneObj <- Rtsne::Rtsne(X = t.orderedDataSet.selected , dim = 2 , perplexity = 5 , check_duplicates = F, 
                         pca_center = F , pca_scale = F)
tsneCoords <- as.data.frame(rtsneObj$Y)
colnames(tsneCoords) <- c("X" , "Y")
tsneCoords$group <- groups$isPatient
ggplot(as.data.frame(tsneCoords)) + 
  geom_point(aes(x = X , y = Y, color = group)) + 
  ggtitle("tSNE on Microbes")

It is not that amazing, but it is better than previous parts and I think that we selected better signals. Because the first one contained a lot of noise and that noise made our clustering worse. But if we normalize the data and try tSNE again:

rtsneObj <- Rtsne::Rtsne(X = t.orderedDataSet.selected , dim = 2 , perplexity = 5 , check_duplicates = F, 
                         pca_center = T , pca_scale = T)
tsneCoords <- as.data.frame(rtsneObj$Y)
colnames(tsneCoords) <- c("X" , "Y")
tsneCoords$group <- groups$isPatient
result <- ggplot(as.data.frame(tsneCoords)) + 
  geom_point(aes(x = X , y = Y, color = group)) + 
  ggtitle("tSNE on Microbes")
result

Wow! amazing. I think we can separate the samples now. Although normal samples are not so close to each other, but patients samples are away from normal samples and simple k-means could have less mislabeling:

kCluster <- kmeans(t.orderedDataSet.selected , centers = 2  , nstart = 20 , iter.max = 10000)
kCluster$cluster
##   206615   206614   206617   206619   206616   206618   206621   206622 
##        1        1        1        1        1        1        1        1 
##   206620   206624   206623   206626   206625   206628   206627   206630 
##        1        1        1        1        1        1        1        1 
##   206629   206636   206635   206644   206643   206645   206646   206648 
##        1        1        1        1        1        1        1        1 
##   206647   206656   206655   206659   206660   206667   206668   206670 
##        1        1        1        1        1        1        1        1 
##   206669   206671   206672 206672.1   206673   206676   206675   206677 
##        1        1        1        1        1        1        1        1 
##   206678   206681   206682   206684   206683   206695   206700   206757 
##        1        1        1        1        1        1        1        1 
##   206758   206702   206701   206708   206709   206746   206747   206713 
##        1        1        1        1        1        1        1        1 
##   206712   206720   206721   206723   206724   206728   206727   206732 
##        1        1        1        1        1        1        1        1 
##   206731   206754   206734   206733   206750   206751   206756   206755 
##        1        1        1        1        1        1        1        1 
##   206762   206761   206767   206766   206769   206768   206770   224330 
##        1        1        1        1        1        1        1        1 
##   224327   224328   224325   224326   206536   206538   206534   215050 
##        1        1        1        1        1        1        1        1 
##   215023   206548   206547   206562   206561   206564   206563   206570 
##        1        1        1        1        1        1        1        1 
##   206569   215067   206571   206603   206572   206604   206605   206608 
##        1        1        1        1        1        1        1        1 
##   206609   215085   215084   214995   214994   215061   215062   215075 
##        1        1        1        1        1        1        1        1 
##   215076   215080   219633   219634   219637   219638   219643   219644 
##        1        1        1        1        1        1        1        1 
##   219646   219645   219659   219653   219654   219693   219668   219670 
##        1        1        1        1        1        1        1        1 
##   219669   219675   219676   219691   219692   206657   206658   206704 
##        1        1        1        1        1        1        1        1 
##   206703   206753   206752   206711   206710   206719   206726   206725 
##        1        1        1        1        1        2        1        1 
##   206730   206729   206739   206738   206741   206740   206742   206743 
##        1        1        1        1        2        2        1        1 
##   224324   224323   215004   215003   215008   215007   215010   215009 
##        1        1        1        1        1        1        1        1 
##   215049   215048   215056   215055   215058   215057   215074   215077 
##        1        1        1        1        1        1        1        1 
##   222170   222171   224844   224845   219652   219651   219656   219655 
##        1        1        1        1        1        1        1        1 
##   219657   219658 
##        1        1
sum(kCluster$cluster == 2)
## [1] 3
sum(kCluster$cluster == 1)
## [1] 175

It is not satisfying because we have we have 133 samples from one class and 45 samples from the other class. So we may use another algorithm for clustering.

Normalized cut

We also can use another clustering technique that is useful when having a graph. There is a good idea to: 1- dimensionality reduction with diffusion maps. 2- Separating the points with DC1. Because of eigen values, we can say that samples that are before DC1 \(= 0\), could be from the other group. This method could be very useful for clustering. But we should see the results to decide between them.

dm2 <- DiffusionMap(data = t(as.matrix(orderedDataSet)) , n_eigs = 2, density_norm = F , distance = "euclidean", k = 20)
## Warning in dataset_extract_doublematrix(data, vars): Duplicate rows removed
## from data. Consider explicitly using `df[!duplicated(df), ]`
# Having duplicates in expression and removing them

tmp <- orderedDataSet[!duplicated(t(as.matrix(orderedDataSet))),]
tmpSampName <- setdiff(colnames(tmp) , groups$`External ID`)
groupTmp <- groups %>% filter(`External ID` %in% colnames(tmp))
groupTmp <- groupTmp[!duplicated(groupTmp),]
tmp <- tmp %>% 
  select(-c("206672.1"))
dcf <- data.frame(DC1 = eigenvectors(dm2)[,1] , DC2 = eigenvectors(dm2)[,2] ,
                  name = colnames(tmp) , gr = groupTmp$isPatient)

p <- ggplot(dcf) + 
  geom_point(aes(x = DC1 , y = DC2 , color = gr)) 
p

So it is not that satisfying but if we subset the data and reduce our features to the selected microbes,

dm2 <- DiffusionMap(data = t.orderedDataSet.selected , n_eigs = 2, density_norm = F , distance = "euclidean", k = 20)
## Warning in dataset_extract_doublematrix(data, vars): Duplicate rows removed
## from data. Consider explicitly using `df[!duplicated(df), ]`
# Having duplicates in expression and removing them

tmp <- t.orderedDataSet.selected[!duplicated(t.orderedDataSet.selected),]
tmpSampName <- setdiff(rownames(tmp) , groups$`External ID`)
groupTmp <- groups %>% filter(`External ID` %in% rownames(tmp))
ids <- unique(groups$`External ID`)
groupTmp <- groupTmp[!duplicated(groupTmp),]
dcf <- data.frame(DC1 = eigenvectors(dm2)[,1] , DC2 = eigenvectors(dm2)[,2] ,
                  name = rownames(tmp) , gr = groupTmp$isPatient)

p <- ggplot(dcf) + 
  geom_point(aes(x = DC1 , y = DC2 , color = gr)) 
p

So it seems that it is not a good method for this work.

So I think the list of responsible signals is:

selectedMicrobes
##  [1] "Unc03exo" "Unc02ruj" "Unc03uuu" "Unc01pk4" "Unc05n7t" "Unc03tc5"
##  [7] "Unc03y4v" "Unc36622" "Unc38399" "Unc01zba" "Unc00y95" "Unc81447"

This microbes could be validated in laboratory but I think with a data analysis approach this is was a good approach.

Conclusion

I tried a lot of ways and changed workflow multiple times to find the responsible features(microbes). I tried several normalization techniques like the simple one used in PCA or the method that I defined that was a l2 norm with square root. After that I tried multiple models like linear model or logsitics regression. I also used dimension reduction techniques like PCA or tSNE to visualize the data and see whether we could use simple k-means afterwards. I also tried a method named “Normalized-cut” and implemented it in a simple way using diffusion maps. They models were not so useful. Finally, I found my way in using multiple hypothesis testing. I tried t-test and permutation test. I adjusted the p-values with fdr method and merged two microbe sets that were defined by selecting p-values below 0.05. Then I tried to do the visualization with this method. First try was without normalization that was unsuccessful but the next try was with normalization that I centered and scaled the samples, and used tSNE. The results were satisfying. The resulted plot is as follows:

result

In this plot, I think that we can distinguish between normal and patient samples. Normal samples are defined by black color and patient samples are defined with blue. Although the patinets samples are not too close to each other, we can distinguish between normal and patient samples. So I think the selected microbes were good features for distinguishing between normal and patient samples.